Mon. Not. R. Astron. Soc. 000, 000-000 (1997) Printed 1 February 2008 (MN MfeK style file vl.4) 



Automated Classification of Stellar Spectra. II: 

Two- Dimensional Classification with Neural Networks and 

Principal Components Analysis 

Coryn A.L. Bailer- Jones 1 *^ Mike Irwin 2 , Ted von Hipper 3 

1 Institute of Astronomy, Madingley Road, Cambridge, CB3 0HA, UK 

2 Royal Greenwich Observatory, Madingley Road, Cambridge, CBS 0EZ, UK 

3 Department of Astronomy, University of Wisconsin, Madison, WI 53706, USA 



Submitted 7 September 1997 



ABSTRACT 

We investigate the application of neural networks to the automation of MK spec- 
tral classification. The data set for this project consists of a set of over 5000 optical 
(3800-5200 A) spectra obtained from objective prism plates from the Michigan Spec- 
tral Survey. These spectra, along with their two-dimensional MK classifications listed 
in the Michigan Henry Draper Catalogue, were used to develop supervised neural 
network classifiers. We show that neural networks can give accurate spectral type 
classifications (ogs = 0.82 subtypes, <7 rms = 1.09 subtypes) across the full range of 
spectral types present in the data set (B2-M7) . We show also that the networks yield 
correct luminosity classes for over 95% of both dwarfs and giants with a high degree 
of confidence. 

Stellar spectra generally contain a large amount of redundant information. We 
investigate the application of Principal Components Analysis (PCA) to the optimal 
compression of spectra. We show that PCA can compress the spectra by a factor 
of over 30 while retaining essentially all of the useful information in the data set. 
Furthermore, it is shown that this compression optimally removes noise and can be 
used to identify unusual spectra. 

This paper is a continuation of the work done by von Hippel et al. (1994) (Paper 

I)- 

Key words: methods: analytical, data analysis, numerical - stars: fundamental pa- 
rameters 



1 INTRODUCTION 

The MK classification of stellar spectra (Morgan, Keenan 
& Kellman 1943; Keenan & McNeil 1976; Morgan, Abt & 
Tapscott 1978) is an important tool in stellar and galactic 
astrophysics. In addition to providing fundamental stellar 
information it was, for example, central to the discovery of 
nearby Galactic spiral arms (Morgan, Sharpless & Oster- 
brock 1952; Morgan, Whitford & Code 1953). 

MK classification is usually performed by a trained ex- 
pert visually matching the overall appearance of a spectrum 
to the 'closest' MK standard spectrum. Such a qualitative 
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method of classification suffers from subjective decisions and 
may differ from person to person: what is deemed as 'close' 
by one person may not be 'close' for another. In addition, 
visual classification is very time consuming, with an expert 
classifying a few 10 stars in a dedicated lifetime. Spectra 
collected from large spectral surveys, often as a by-product 
of other surveys (e.g. the Sloan Digital Sky Survey (Kent 
1994)) will have to be classified by automated means. Thus 
if stellar classification is to continue to be useful to the 
astronomical community, it has to be made faster and put 
on a more quantitative and objective basis. 

In this paper we investigate the application of neural 
networks to the MK classification of optical stellar spec- 
tra. The so-called 'supervised' neural networks used in this 
project are implemented to yield an accurate mapping be- 
tween a data domain (the stellar spectra) and a classifica- 
tion domain (the MK classifications). While visual classi- 



© 1997 RAS 



2 C.A.L. Bailer- J ones et al. 



fiers have mentally determined this mapping, they have not 
quantified it. This mapping is, however, present intrinsically 
in a large set of classified spectra. The neural network's re- 
sultant classification criteria will be essentially equivalent 
to the human's criteria. However, whereas a human's crite- 
ria may vary from adverse physiological and psychological 
factors such as health and mood, the network will retain a 
consistent set of classification criteria. We will also demon- 
strate how the technique of Principal Components Analy- 
sis (PCA) can be used to optimally compress the spectra. 
This has a number of advantages including the preferen- 
tial removal of noise and an ability to isolate bogus spectra. 
Furthermore, using PCA-compressed spectra (rather than 
complete spectra) in the neural network classifiers leads to 
reduced training times and better convergence stability. 

While MK classification will continue to be a useful tool 
to astronomers, it becomes increasingly desirable to obtain 
physical parameters (T e g, logg, etc.) for stars. Bailer- Jones 
et al. (1997b) describe a neural network approach to the 
parametrization of stellar spectra by training a neural net- 
work on synthetic spectra. 



2 PREVIOUS CLASSIFICATION WORK 

There have been a number of attempts in the past to au- 
tomate stellar spectral classification. Kurtz (1982) classified 
low (14 A) resolution spectra using cross-correlation with 
standard spectra and achieved a mean classification error of 
2.2 spectral subtypes for stars in the range BO to M2. The 
same technique gave poor luminosity classification results. 
LaSala (1994) used the related technique of minimum dis- 
tance classification to classify a set of 350 B-star spectra, 
and achieved a mean error of 1.14 spectral subtypes. 

The classification work of von Hippel et al. (1994) (Pa- 
per I) was one of the first applications of neural networks to 
stellar spectral classification. Their neural network solution 
based on a set of 575 spectra gave an RMS classification 
error of 1.7 spectral subtypes (and a 68-percentile error of 
1.4 spectral subtypes) for spectra in the range B3 to M4. 
Gulati et al. (1994) trained a neural network on a set of 55 
spectra giving an incomplete coverage of spectral classes O 
through to M. While they reported classification errors of 2 
subtypes, it should be noted that they used a very complex 
neural network with over 18,000 free parameters (network 
weights), with no justification of why such a complex net- 
work was required. The result is that the determination of 
these weights was likely to be poorly constrained by the 
small amount of training data used. 

There have also been attempts to classify spectra be- 
yond the visual. Weaver & Torres-Dodgen (1995) used neu- 
ral networks to classify infrared spectra (5800 A to 8900 A) 
of A stars at 15 A, and achieved spectral type and luminosity 
class classification precisions of 0.4 subtypes and 0.15 lumi- 
nosity classes respectively. They have recently achieved good 
results in the infrared for a wide-range of spectral types (O- 
M) and luminosity classes (I-V) (Weaver & Torres-Dodgen 
1997). Vieira & Pons (1995) used a neural network trained 
on a set of 64 IUE ultraviolet spectra (150 A to 3200 A) in 
the range 03 to G5, and reported a classification error of 1.1 
spectral subtypes. It was unclear, however, why a network 
with 110,000 weights was required. 



Table 1. The spectral data. 
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Whitney (1983) has examined the use of Principal Com- 
ponents Analysis for spectral classification of a set of 53 
A and F stars. His data set consisted of 47 photoelectric 
measurements of spectra over the wavelength range 3500 A 
to 4000 A. He applied PCA to his data set and then per- 
formed a regression on the three most significant compo- 
nents, achieving an average classification error of 1.6 spectral 
subtypes. 



3 THE SPECTRAL DATA 

The classification techniques described in this paper were de- 
veloped using a set of 5000 spectra taken from the Michigan 
Spectral Survey (Houk 1994). The data reduction method 
is described in Paper I and in more detail in Bailer- Jones 
et al. (1997a). The present work expands the data set of 
Paper I by a factor of ten and doubles the spectral resolu- 
tion. The wavelength range is also slightly different, with the 
details summarized in Table The classification informa- 
tion required to train and test the neural networks is taken 
from the Michigan Henry Draper (MHD) catalogue (Houk & 
Cowley 1975; Houk 1978, 1982; Houk & Smith-Moore 1988). 
In this paper we only examine the automated classification 
of normal stars in terms of their MK spectral type and lu- 
minosity classes. However, the MHD contains considerable 
additional information, particularly with regard to peculiar- 
ities, so this catalogue would be suitable for developing more 
detailed automated classifiers. 

Our set of 5144 spectra contains stars over a wide range 
of spectral types (B2-M7) for luminosity classes III, IV and 
V as well as the 'intermediate' luminosity classes III/IV and 
IV/V. This set, hereafter referred to as data set 'A', was 
used to develop a spectral type classifier. A second data 
set, 'B', contains only 'whole' luminosity classes (i.e. not 
the III/IV and IV/V spectra). This set of 4795 spectra is 
used to develop the luminosity class classifier. The distribu- 
tion of spectral types in this latter set is shown in Figure [I]. 
The spectra were normalized to have equal areas, i.e. equal 
total intensities, thus removing any scale differences result- 
ing from different apparent magnitudes. Line-only spectra 
were obtained for both data sets, using a non-linear rectifi- 
cation method (Bailer- Jones et al. 1997a). In this paper we 
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Table 2. Numerical coding of the MK spectral types. 'SpT' will 
be used to label this code, so that 44 SpT = K2. The MHD 
catalogue omits some classes (e.g. F4 and G7). 




Figure 1. Distribution of spectral types for each luminosity class 
in data set B. The dotted line represent giants (III), the dashed 
line subgiants (IV) and the solid line dwarfs (V). The distribution 
for data set A is very similar. 



investigate automated classification with both line-only and 
line+continuum spectra. 



4 NEURAL NETWORK MODELS 

A neural network is a computational tool which will provide 
a general, non-linear parameterized mapping between a set 
of inputs (such as a stellar spectrum) and one or more out- 
puts (such as a spectral classification). The type of neural 
network architecture used in this paper is known as a multi- 
layer perceptron neural network (e.g. Hertz, Krogh & Palmer 
1991; Bishop 1995; Lahav et al. 1996). In order to give the 
correct input-output mapping, the network is trained on a 
set of representative input-output data. Training proceeds 
by optimising the network parameters (the 'weights') to give 
the minimum classification error. With the weights fixed, the 
network is used to produce outputs (MK classifications) for 
unclassified inputs (stellar spectra), effectively by interpo- 
lating the training data. Note that the output from a neural 
network is some non-linear function of all of the network 
inputs. Thus the network's classifications are based on the 
appearance of the whole spectrum: we do not have to tell 
the network in advance which spectral lines are relevant. 

Network training used the methods of gradient descent 
and backpropagation (Rumelhart, Hinton & Williams 1986). 
Network performance was found to be insensitive to the ex- 
act vaules of the 'gain' and 'momentum' parameters. It was 
determined that 1000 training iterations were sufficient to 
ensure that the network error, as evaluated on an indepen- 
dent test data set, had reached its minimum error. Up to 50 
times as many iterations gave only negligible improvement. 
Training a neural network on 2500 spectra represented as 
50 PCA coefficients typically took about an hour. The ap- 
plication of these trained networks to then classify a similar 
number of spectra is a few seconds. Further details can be 
found in Bailer- Jones (1996). 

Spectral type classification was performed by represent- 
ing the 57 MK classifications in the MHD as points on a 
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continuous scale of numbers 1-57 (Table |2j). This is reason- 
able as we know that spectral type is closely related to effec- 
tive temperature (T a s ) and MK spectral types are essentially 
binnings of a continuous sequence. The appropriate neural 
network therefore had a single output giving a continuous 
number in the range 1-57. We shall refer to this as continu- 
ous mode. Note that although the network is trained on in- 
teger or half-integer values, it can produce any real-value 
classification for new spectra. 

For the luminosity class problem, we used a network 
in probabilistic mode. This refers to a network with several 
outputs, each output referring to a mutually exclusive class. 
In our case we had three outputs, with one output corre- 
sponding to each of classes III, IV and V. In this mode, the 
values in each node can be interpreted as the probability that 
the input spectrum is of that particular luminosity classes. 
Probabilistic interpretation of neural networks in both prob- 
abilistic mode (also referred to in the neural network liter- 
ature as 'classification') and continuous mode (also referred 
to as 'regression') can be taken much further. In particular, 
the outputs can be interpreted as Bayesian posterior prob- 
abilities (e.g. Richard & Lippmann 1991; MacKay 1995). 



4.1 Committee of Networks 

Identical neural networks trained from different initial ran- 
dom weights should ideally converge on the same weights 
and hence produce identical input-output mappings. How- 
ever, given the high dimensionality and complexity of the 
error surface which is explored during training, it is unlikely 
that numerical minimization procedures with different ini- 
tializations would converge on exactly the same final weight 
vector. 

We can reduce the effects of this 'convergence noise' 
problem by using a committee of neural networks. The com- 
mittee consists of L identical networks which are separately 
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trained from different initial weights. When the network is 
used in continuous mode, the committee classification of the 
p t spectrum, C p , is just the average of the individual net- 
work classifications. In probabilisitic mode, the outputs from 
each network corresponding to a given class are summed 
to give the (unnormalized) committee probability of that 
class. In some applications, C p will be a more accurate clas- 
sification than any of the individual network classifications 
(Bishop 1995). All classification results presented in this pa- 
per were obtained with a committee of ten neural networks. 



4.2 Neural Network Error Measures 

The performance of a trained neural network (or committee 
thereof) is evaluated by comparing its classifications of an 
independent set of spectra with their 'true' classifications 
in the MHD catalogue. We must not evaluate the perfor- 
mance of the neural network using the data on which it was 
trained. This is because it is possible for the network to 
overfit the training data rather than capture the underlying 
input-output mapping which the training data represent. 
This can occur if there is either insufficient data to constrain 
the determination of the network weights, or if the training 
data is not representative of the problem in hand. Such an 
overfitted network would typically produce very low classi- 
fication errors on the training data yet produce relatively 
large errors on an independent test data set. Our procedure 
was therefore to train a network on half of the spectra and 
test its performance on the other half, there being approxi- 
mately 2500 spectra in each half. 

We use the following error measures to evaluate the 
performance of our networks. The first is the RMS error, 
a- rms , of the difference between the network classifications 
and the 'true' classifications. This statistic suffers from the 
usual problem that it is sensitive to outliers and may not, 
therefore, be very characteristic of the majority of residuals 
in the core of the distribution. A more robust measure uses 
only the central 68% of the residuals, o"68- If the residuals are 
distributed as a Gaussian, a^s is the la standard deviation 
of a Gaussian distribution. Both <T rms and a^g are external 
errors, because they are measured with respect to a set of 
ideal classifications which are external to the neural network. 

Due to the 'convergence noise' problem (section |f.l| ), a 
network re-trained from different initial weights would give 
slightly different classifications. This level of difference is 
characterised by the internal error and is evaluated using 
the committee of networks. The internal error for a single 
spectrum is: 



\ 



l — L, 



(1) 



C\ is the classification given by the i* network in the com- 
mittee. The total internal error, <7i n t, is a? nt averaged over 
all spectra, and can be considered as the contribution to 
the total (external) error on account of imperfect network 
convergence. 

An additional consequence of the 'convergence noise' is 
that a single value of the external error, cres, is not exact. 
Thus if we wish to compare values of oes produced by differ- 
ent network models, then we need to know the uncertainty 
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Figure 2. Average spectrum for line+continuum version of spec- 
tra in data set A. 



in <T68 in order to known whether the difference between two 
values of a^s is statistically significant. A suitable measure 
of this uncertainty is given by the standard error in the a of 
a Gaussian distribution, 

C68 



(2) 



where M is the number of spectra in the test set. This result 
holds exactly in the limit as M — > oo. 



5 PRINCIPAL COMPONENTS ANALYSIS 

It is usually desirable - and often essential - to reduce the 
dimensionality of a data set prior to classification. Dimen- 
sionality reduction often leads to enhanced reliability when 
using neural networks to give a generalized mapping, on ac- 
count of the reduced number of parameters in the network. 
This will also lead to greatly reduced training times. While 
classification with complete spectra can produce good re- 
sults (Paper I), dimensionality reduction may be essential 
in some applications, such as when data transmission rates 
from space-based observatories are limited (e.g. Lindegren 

6 Perryman 1996). 

Principal Components Analysis (PCA) is one method 
for achieving a dimensionality reduction. PCA is a method 
of representing a set of A^-dimensional data by means of 
their projections onto a set of r optimally defined axes. As 
these axes (the principal components) form an orthogonal 
set, PCA yields a linear transformation of the data. A com- 
pression of the data is obtained by ignoring those compo- 
nents which represent the least variance in the data. The 
compressed spectra, as represented by their projections, onto 
the most significant principal components, are then used as 
the neural network inputs. In this section we will demon- 
strate the benefits of a PCA preprocessing of spectra, such 
as noise removal and identification of bogus spectra, and 
highlight some of its problems. In the next section we shall 
demonstrate that high quality classifications can be achieved 
with these compressed spectra. 

Below we discuss the application of PCA to the 
line+continuum spectra of data set A. A separate analysis 
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was carried out for the line-only spectra (Bailer- Jones 1996). 
PCA has been used in several areas of astronomy, including 
stellar spectral classification (Deeming 1964; Whitney 1983; 
Storrie-Lombardi et al. 1994), galaxy spectral classification 
(Folkes, Lahav & Maddox 1996) and quasar spectral classifi- 
cation (Francis et al. 1992). Further details of the technique 
can be found in texts (e.g. Murtagh & Heck 1987). 



in Figure pi No single coefficient shows a strong correlation 
across the full range of subtypes, so classification cannot be 
achieved using any one coefficient. Some coefficients do, how- 
ever, show correlations over part of the spectral range. We 
saw in Figure ^ that the 7 th eigenvector represents some fea- 
tures of late- type stars and we see in Figure [| that the corre- 
sponding admixture coefficient, shows a trend with spectral 
type for SpT > 48. 



5.1 The Principal Components 

PCA was performed on zero-mean spectra: the subtracted 
average spectrum, x, is shown in Figure |^. Figure ^ shows 
the first ten principal components, u, plotted on a common 
vertical scale. Because the principal components are eigen- 
vectors of a symmetric matrix, they are orthogonal, and it 
is convenient also to normalize them to unit length, so that 
uf .Uj — Sij. As the eigenvectors are simply rotations in the 
iV-dimensional data space of the original axes on which the 
spectra are defined, they resemble spectra, in particular in 
that they have the same number of elements (820) as the 
original spectra. 

It is interesting that both the stellar continuum and the 
individual spectral lines are distributed across many eigen- 
vectors. For example, the Ca II H&K lines at 3934 A and 
3969 A are distinct in the first four eigenvectors as well as 
the average spectrum. (Note that the sign of the eigenvec- 
tors is arbitrary, as the admixture coefficients - the projec- 
tions of the spectra onto the principal components - can be 
negative.) That the features do not separate into different 
components is not surprising: We known from the physics 
of line formation in stellar photospheres that a spectrum is 
not a linear combination of spectral features, so we should 
not expect a linear decomposition of the spectrum (such 
as PCA) to clearly isolate these spectral features. Features 
are generally distributed across many components, e.g. com- 
ponents 5 and 8 which show many lines common to a wide 
range of spectral types. However, some features are predom- 
inantly represented by a single components. For example, it 
can be seen that the TiO bands (which extend redward from 
about 4500 A) characteristic of M stars are more strongly 
represented in the 7 th principal component than any other 
component. 

The principal components represent sources of variance 
in the data. Thus the most significant principal components 
show those features which vary the most between the spec- 
tra: it is important to realise that the principal components 
do not simply represent strong features. Note also that the 
eigenvectors obtained from PCA are entirely dependent on 
the data. Therefore the eigenvectors for a different set of 
stellar spectra are likely to be rather different. 



5.2 The Admixture Coefficients 

The projection of the p th spectrum onto the k th principal 
components is known as the admixture coefficient, dfc, p . Be- 
cause PCA is only a linear transformation of the spectra, one 
would not expect there to be a strong correlation between 
the stellar classification parameters and the admixture co- 
efficients. Figure ^| shows the admixture coefficients for the 
line+continuum spectra plotted against the coded MK spec- 
tral type, SpT, for each of the first ten eigenvectors shown 



5.3 PCA as Data Compression and Noise Filter 

The most significant principal components contain those fea- 
tures which are most strongly correlated in many of the 
spectra. It follows that noise - which is uncorrelated with 
any other features by definition - will be represented in the 
less significant components. Thus by retaining only the more 
significant components to represent the spectra we achieve a 
data compression that preferentially removes noise. The re- 
duced reconstruction, y p , of the p spectrum x p , is obtained 
by using only the first r principal components to reconstruct 
the spectrum, i.e. 

k=r 

y P = x + ^ a fe , p u fc , r<N. (3) 

Let e p be the error incurred in using this reduced recon- 
struction. By definition, x p = y p + e p , so 



k=N 

s P = ^ afc iP Ufc 

fc = r + l 



(4) 



Averaging over all spectra gives rise to the average error, £, 
from which we define the figure-of-merit of the reconstruc- 
tion quality of the whole data set as R — 1 — £, and 



R = 100% 



(5) 



where Xk is the fc* eigenvalue of the covariance matrix, 
S, of the data (Bailer- Jones 1996). Figure |^a shows how R 
varies with the number of eigenvectors used to reconstruct 
the spectra, and Table ^ tabulates some of these values. We 
see that only 25 eigenvectors (~ 3% of the total) are suffi- 
cient to reconstruct 95.8% of the variance in the data. This 
large factor of data compression is a great benefit for any 
approach to the classification problem as it corresponds to 
a large reduction of the dimensionality of the space required 
to describe the data. 

It is convenient to define an empirical measure of the 
reconstruction error for individual spectra 



E 



100% \ - , 

g / { \Vi,V X i,p\ 



(6) 



where S is the total area under each spectrum, which was 
fixed to a constant value when the spectra were area nor- 
malized. This measure is useful as it does not require the 
existence of eigenvalues for its evaluation, and hence can 
be used to compare reconstruction errors between any data 
compression techniques. The frequency distribution of these 
errors is shown in Figure |^. 

Figure ^ gives a visual presentation of spectral recon- 
struction by showing an M star spectrum reconstructed with 
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Figure 3. The first ten principal components of the line+continuum spectra in data set A. The principal 
components are normalized eigenvectors plotted against wavelength. A number of spectral features can be 
seen. In particular, the 7 th eigenvector strongly represents the TiO bands of late-type stars. 



Figure 4. (This figure is provided as a separate GIF file.) The first ten principal component admixture 
coefficients for all spectra (line+continuum format) in data set A plotted against spectral subtype (SpT). 
The correlation between the 7 th coefficient and spectral subtype for SpT > 48 (M stars) is accountable by 
reference to Figure ^ where we see that the 7 th eigenvector gives a strong representation of the TiO features 
in M stars. In general, however, the admixture coefficients cannot be used individually to give spectral type 
classifications. 



an increasing number of components. We have seen in Fig- 
ure ^ that the 7 th eigenvector is representative of the TiO 
bands in late type stars. In reconstructing this M star spec- 
trum we see that the reconstruction error drops significantly 



when the 7* component is added, and that visually this 
r = 7 spectrum is greatly improved over the r = 6 one. 

An optimal trade-off between compression (and noise 
removal) and accurate spectral representation is achieved 
for r = n when ^ ~ const. If a PCA were performed on a 
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Figure 5. Quality of spectral reconstruction with a reduced num- 
ber of eigenvectors, a) The quality of reconstruction increases 
dramatically over the first 25 or so eigenvectors, b) At n Ri 25, 
4S ?s 0, and the remaining eigenvectors are predominantly noise. 
Thus an optimal reconstruction only requires approximately the 
first 25 eigenvectors. R is the reconstruction error predicted by 
the eigenvalues (equation ^) and so is the reconstruction error for 
the whole data set. 



Table 3. Selected values from Figure 
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87.139 


2.255 


-1.789 


5 


88.989 


1.850 


-0.405 


10 


93.861 


0.570 


-0.119 


15 


94.930 


0.174 


-0.004 


20 


95.458 


0.074 


-0.012 


25 


95.760 


0.052 


-0.005 


30 


95.972 


0.036 


-0.002 


35 


96.121 


0.027 


-0.001 


40 


96.244 


0.023 


-0.001 


45 


96.352 


0.021 


-0.000 


50 


96.449 


0.018 


-0.001 


820 


100.000 


0.000 


0.000 



data set of pure noise, no component would be a greater dis- 
criminant than another, and their ranking would be random 
giving = const for all n. Turning this argument around, 
the point where the R-n plot levels off (4^ w const) is where 
the components begin to be dominated by noise. This occurs 
between n = 20 and n — 30 (Figure |B|b). Figure ^| compares 
a faint spectrum reconstructed with 25 components with the 
original spectrum. The reconstructed spectrum is consider- 
ably less noisy and the residual spectrum contains no major 
features. Note that if we had lower S/N data, 4^ would 
turn constant at a smaller value of N. Thus for lower S/N 
data, the optimality criterion translates into retaining fewer 
components. 
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Reconstruction error, I 1 

Figure 6. Frequency distribution of the empirical reconstruc- 
tion error defined by equation ^. The solid line shows the er- 
rors for a 50-component reconstruction and the dashed line for 
a 25-component reconstruction, (a) histogram of the reconstruc- 
tion errors, (b) the cumulative distribution of (a). This shows 
the fraction of spectra which are reconstructed with an error less 
than that shown on the horizontal axis. For example, 95% of the 
spectra have E < 4.6%. 

One of the drawbacks of PCA is that very weak spectral 
features or features which are only present in a small fraction 
of the data will be lost in a reduced reconstruction. This is 
because they show very little correlation across the data 
set. Thus the residual spectrum will contain, in addition 
to noise, some weak features which are not well correlated 
across the spectra. However, as can be seen from Figure ^ 
by no means are all such features are lost. Thus we see that 
the principal components (like neural network classifiers) are 
sensitive to the relative frequency of occurrence of features 
in the data set. An advantage of this is that PCA can be 
used to filter out bogus features (e.g. plate scratches, strong 
cosmic rays) because such features are rare and randomly 
positioned. This is demonstrated in Figure ^[ 

5.4 Constructing New Spectra 

Having performed PCA on a set of stellar spectra (the con- 
struction set), we may want to find the admixture coeffi- 
cients for a new set of data, such as more recently acquired 
spectra. We do not have to re-evaluate the principal compo- 
nents using the combined data sets: instead we can project 



© 1997 RAS, MNRAS 000, 000-000 



8 C.A.L. Bailer- J ones et al. 




Figure 7. Reconstruction of an M star (HD 14002, type M3/4 
III, magnitude 9.4). The figures in bold refer to the number of 
eigenvectors used to reconstruct the spectrum. The percentage 
figures are the corresponding reconstruction error, E. Note the 
improved reconstruction as the 7 th component is added. 



the new spectra onto the old components. We can even ob- 
tain the admixture coefficients for incomplete spectra, thus 
permitting a complete reconstruction. This would be useful 
if we wanted to apply the neural network classifier to new 
spectra with a slightly different wavelength coverage. 

Another advantage of PCA is that the reconstruction 
error, E, can be used to filter-out bogus and non-stellar spec- 
tra, as in such cases the reconstruction error would be lar ger 
than the typical values in Figure ^j. As an example, Figure |Io| 
shows how this filtering works by attempting to reconstruct 
sky noise. The 25-component reconstruction gives an error 
of 28%, which is several times larger than the maximum 
reconstruction error of the spectra in the construction set. 

This method of rejection assumes that the data used 
to define the principal components are representative of the 
stellar spectra which we want to classify. Thus rare types of 
stars with strong features, e.g. Me stars, would be filtered 
out along with all the bogus spectra. Note that a neural 
network applied to either the complete spectra or the PCA 
compressed spectra suffers from a similar problem: spectral 
types which are relatively rare in the training data set will be 
poorly classified. But with PCA there is no reason why such 
objects have to be blindly rejected: A large reconstruction 
error indicates an unknown object, so PCA could be used as 




4000 4500 5000 



Figure 8. Reconstruction of a faint G star (HD 219795, G3 V, 
magnitude 11.1). The reconstruction error is E = 4.26%. The top 
spectrum is the original spectrum, the middle is the spectrum 
reconstructed with 25 components and the bottom is the resid- 
ual spectrum, i.e. reconstructed minus original. The arrow marks 
the location of a weak luminosity sensitive line Sr II which is re- 
tained in the reconstruction. This spectrum is one of the faintest 
(and hence noisiest) in the data set, so most reconstructions are 
considerably better. 
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Figure 9. The use of PCA in filtering out bogus features. (See 
caption to Figure ^|) The contaminating feature here is probably 
due to a piece of dust on the plate during plate scanning. The 
star is HD 3391, type F5 V, magnitude 10.8. The reconstruction 
error is E = 5.30%. 
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Figure 10. Bogus spectra (in this case a patch of sky) are recon- 
structed with a large error, which can be used to filter them out 
of the data set prior to classification. The figures in bold refer to 
the number of eigenvectors used to reconstruct the spectrum. The 
percentage figures are the corresponding reconstruction error, E. 



a coarse front-end classifier to a more refined classification 
system. 



6 SPECTRAL TYPE RESULTS 

Neural networks were applied to the spectral type problem 
in continuous mode (see section |^). Each spectrum is rep- 
resented by the admixture coefficients of the first 25 or 50 
principal components. Data set A is used throughout. 

Figure |ll] shows classification results using a com- 
mittee of ten networks applied to the 50-component 
line+continuum PCA spectra. Each network has a 50:5: f 
architecture, indicating 50 nodes in the input layer, 5 nodes 
in the hidden layer and a single output (numbers exclude 
bias nodes). The average classification error is <768 = 1-07 
SpT, which has an associated uncertainty of e = 0.02 SpT, 
i.e. <j 68 = 1.07 ±0.02 SpT. cr rms = 1.41 SpT, indicating that 
the tails of the distribution of the residuals are 'heavier' than 
we would get if the distribution were Gaussian. 

The optimality criterion for reconstructing spectra (sec- 



tion 



5.3) specified that 25 principal components would give 
an optimal reconstrutcion. However, this figure was achieved 
without any reference to how we would subsequently use the 
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Figure 11. Spectral type classification results from a committee 
of ten 50:5:1 neural networks applied to linc+continuum spec- 
tra. The left-hand panel is a plot of the committee classifications 
against the 'true' classifications listed in the MHD catalogue. The 
diagonal line is the locus of points for which the committee classi- 
fications equal the catalogue classifications, and is drawn to guide 
the eye. Note that even a perfect classifier would not give results 
exactly on this line on account of noise in the spectra and un- 
certainty in the 'true' classifications. The right-hand panel is a 
histogram of the classification residuals, C p — G p , where C p is 
the committee classification and G p the catalogue classification 
of the p th spectrum. The classification error is CT68 = 1.07 SpT. 
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Figure 12. Variation of network classification (external) error, as 
a function of number of PCA inputs to the neural network. The 
solid line is a^S and the dotted line is cr rms . The number of inputs, 
r, is the number of principal components used to represent each 
spectrum. The error bars on the lower curve are 3 X e errors (see 
equation ^). Statistically speaking, there is a significant drop in 
the classification error from r = 1 to r w 25, followed by a barely 
significant decrease in error to r = 50. 



spectra. A direct evaluation of the number of principal com- 
ponents required is shown in Figure ^| which summarizes 
the performance of networks with an r:5:l architecture for 
a range of r. The behaviour is well anti-correlated with the 
behaviour of the reconstruction quality, R, in Figure This 
is what we would expect and demonstrates that the network 
is making best use of the information given to it. 

As the number of hidden nodes in the network increases, 
so does its ability to accurately model increasingly complex 
input-output functions. Figure |l^ shows how the classifi- 
cation errors vary with the number of hidden nodes These 
results show that as q is increased to about seven, there is 
an improvement in classification performance, but beyond 
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Figure 14. Best spectral type results (c68 = 0.82 SpT), obtained 
with a committee of ten 25:5:5:1 networks with line-only spectra. 



Figure 13. Variation of the network classification (external) error 
as a function of the number of hidden nodes, q, in a 50:q:l neural 
network. For each value of q a single neural network was trained 
using line+continuum spectra and its performance assessed by 
measuring crag (solid line) and cr rms (dashed line). The error bars 
are 3 X e, where e is the statistical uncertainty in CT68- 



Table 5. Spectral type classification errors for each member a 
committee (25:5:5:1 networks on line-only spectra). The commit- 
tee error is lower than the error achieved by any one of its mem- 
bers acting alone. The dispersion of results for individual network 
is typical of the other committees used in the spectral type prob- 
lem. 



this there is no statistically significant improvement. (Note 
that the neural network with only one hidden node can only 
linearly discriminate between spectral types, which explains 
the sharp decrease in classification error between q — 1 and 
q = 2.) A network with q = 250, gave <t 68 = 1.00 ±0.01 SpT, 
which is a small increase over q ~ 5. While it is theoretically 
true that a sufficiently large number of hidden nodes will in- 
crease performance (Hornick, Stinchcombe & White 1990), 
the number of hidden nodes required would be inhibitively 
large, both on account of training time and over-fitting the 
data. The solution is to use an additional hidden layer. This 
can provide increased complexity with a smaller total num- 
ber of weights. A committee of ten 50:5:5:1 neural networks 
gave a classification error of aes ~ 0.88 ± 0.01 SpT, which 
is significantly better than the results with only one hidden 
layer. 

Table | is a summary of the spectral type classifica- 
tion results obtained with a range of network architectures. 
The most significant result is that smaller external errors 
(<T68 an d 0"rms) are obtained with two hidden layers com- 
pared with one hidden layer. Interestingly, the internal error 
is smaller when the total number of weights is smaller, pre- 
sumably because the minimum of the error function is easier 
to locate consistently when the dimensionality of the space 
is lower. This is one reason for using PCA to compress the 
spectra, as it results in a network with fewer weights. 

Our results also show a very small improvememnt in 
classification performance when using line-only spectra. This 
is perhaps to be expected as the continuum is more contami- 
nated by effects such as interstellar reddening and non-linear 
photographic response. It is interesting that von Hippel et al. 
(1994) (Paper I) obtained worse results with line-only spec- 
tra. This is probably because they used spectra at half the 
resolution, resulting in the loss of some line-information 
(whereas the continuum would hardly be affected). 

The best results are plotted in Figure [uj Table ^ lists 
the <T68 error achieved by each neural network in this com- 
mittee and confirms empirically that the error obtained by 
the ten networks acting as a committee is lower than any 



Network 


C68 


ff rms 


1 


0.90 


1.21 


2 


0.86 


1.23 


3 


0.87 


1.09 


4 


0.89 


1.27 


5 


0.93 


1.24 


6 


0.91 


1.21 


7 


0.87 


1.17 


8 


0.95 


1.31 


9 


0.89 


1.13 


10 


0.87 


1.17 


Committee 


0.82 


1.09 


Tint 


0.36 





one of the networks acting alone. Figure |l5| shows graphi- 
cally the degree of reproducibility of results. The committee 
result of crgg = 0.82 compares quite favourably with the 
error in the catalogue classifications themselves, estimated 
to be ergs = 0.63 SpT (N. Houk, private communication, 
1995), which represents a lower limit on the precision we 
can achieve. Larger networks did not improve performance 
further. 

To prove that the PCA was not limiting the perfor- 
mance of the neural network classifiers, we trained a com- 
mittee of neural networks on the original (non-PCA) 820- 
bin spectra. The mean classification error of o"68 = 0.82 
SpT shown in Table ^ is no better than the PCA-input 
results for comparable numbers of hidden nodes, confirm- 
ing that the PCA compression has not resulted in the loss 
of any classification-significant information. Strictly speak- 
ing, this 820:5:5:1 network has too many weights to be well- 
determined by the data (4141 weights (unknowns) vs. 2500 
spectra (equations)), so it may be surprising that such a net- 
work can generalize. However, due to correlations between 
the spectral features, some input weights will be correlated, 
effectively reducing the number of parameters which must 
be determined by the data. Indeed, the Principal Compo- 
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Table 4. Summary of the spectral type classification results using committees of ten net work s 
for a variety of architectures. The error measures in columns 3-6 are defined in section 4.2. 
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25:3:1 


line+continuum 
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±0.015 
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line+continuum 


1.11 


±0.015 


1.46 


0.18 


136 
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±0.015 
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261 
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0.86 


±0.012 
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166 
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±0.011 
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Figure 15. (This figure is provided as a separate GIF file.) Performance of the first five members of the 
committee networks used to produce the results in Figure 14l The left-hand figures show how the neural 
network error drops with increasing iteration number. The dashed line shows the error on the training set 
and the solid line the error on the test set. Note that the network error, as measured on the test set during 
training, briefly increases early on in the training for the first and third networks. This demonstrates that 
we should not stop training the instant that the error on the test set rises. It is also interesting that most of 
the training takes place in the first few iterations. 
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using are averages over all spectral types. Figure |l^ shows 
that <T68 and (Tint vary considerably as a function of spec- 
tral type. This is influenced by the frequency distribution of 
spectral types: As we can see from comparison with Figure [l], 
where there are relatively few spectra in the training set the 
classification errors are correspondingly higher. This is be- 
cause the neural network has been presented with relatively 
little information about these regions, and the few spectra 
are unlikely to give adequate information on the intra-class 
variability. Indeed, if we remove the few spectra at the earli- 
est and latest spectral types, our overall error drops towards 
the limit imposed by the training data. 

We have also experimented using neural networks 
in probabilistic mode for spectral type classification. A 
committee of ten such 50:5:5:57 networks applied to the 
line+continuum spectra gave <7 rms = 2.09 SpT, which is 
somewhat inferior to continuous output results. However, 
the probabilistic approach does offer some advantages, such 
as the ability to recognise composite spectra (Weaver 1994). 



Figure 16. Neural network classification errors as a function of 
spectral type. The solid line is the external error, <768> an d the 
dotted line the internal error, o- ln t. The error bars on the ex- 
ternal error histogram are 3 X e errors for each spectral type 
bin, showing that the differences are significant. These results 
are from the committee of ten 25:5:5:1 neural networks applied 
to line+continuum spectra shown in Table W 



nents Analysis showed that the effective dimensionality of 
the spectra is only about 25. 

The internal and external error measures we have been 



7 LUMINOSITY CLASS RESULTS 

Neural networks were applied to the luminosity class prob- 
lem in probabilistic mode (see section ^) using data set B. 
The spectrum is classified as that class for which the output 
is highest. We only consider two-hidden layer networks. 

The measure of network performance when we have a 
few discrete classes is by means of the confusion matrix. This 
reports the fraction of spectra which have been correctly and 
incorrectly classified for each class. Table [] compares the re- 
sults from different committees of networks, with the four 
combinations of line-only or line+continuum spectra rep- 
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Table 6. Confusion matrices for four different committees of ten networks. Each confusion matrix lists the 
percentage of spectra which have been classified correctly and incorrectly. Thus from the top-left matrix we 
see that the committee correctly classifies 97.7% of class Vs as class Vs, but incorrectly classifies 1.5% of 
spectra which are class V in the catalogue as class III. The rows of each matrix sum to 100%. 
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resented with 25 or 50 principal components. We see that 
there is little difference in performance between any of these 
combinations. This is in agreement with the spectral type 
classifications and confirms that most of the luminosity class 
information is contained within the first 25 admixture coef- 
ficients. The networks give very good results for classes III 
and V (but not class IV), although the better results for 
class V than class III may be due to larger fraction of class 
Vs in the data set (1.6 times as many). 

Figure ^ shows the distribution of the probabilities 
which the committee assigns for each class. While most of 
the class III and V objects are correctly classified with large 
confidence, the opposite is true for class IVs. The nework is 
not classifying IVs at random (otherwise we would expect 
it to classify about 33% correct). Rather, the networks have 
a preference for classifying IVs as either Ills or Vs. While 
the relative paucity of class IVs in the training set will have 
some influence, they are not so rare to give such poor per- 
formance. Nor are the IVs lower quality spectra. 

Referring back to Figure ^ we see that there is a fairly 
strong correlation between spectral type and luminosity 
class. Is the network using spectral type information to pro- 
duce luminosity classifications? It would do quite well if it 
simply classified all spectra later than about K0 as giants 
and the rest as dwarfs. (Note that much of this correlation is 
real, because the HR diagram is not uniformly populated.) 

To find out what spectral information the networks are 
using, an 'overlap' data set was created by selecting spec- 
tra of classes III, IV and V in roughly equal numbers for 
that range of spectral types where their frequency distribu- 
tions overlap (around G6). These spectra were then classi- 
fied using a committee previously trained on all the data. 
We see from Table ^ that the committee still yields good 
classifications of classes III and V, for which it cannot be 
using spectral type information as there is no correlation 



Table 7. An 'overlap' data set is classified using the committee 
of ten 50:5:5:3 networks trained on line+continuum spectra. This 
overlap data set consists of those spectral classes for which the 
three luminosity classes are equally represented, which is for spec- 
tral types G5/G6, G6, G6/G8 and G8. This data set consists of 
86 Ills, 83 IVs and 68 Vs. The committee still classifies Ills and 
Vs well and still fails on IVs (cf. Tabic |). 

50:5:5:3, line+cont. spectra, G6-G8 spectra only 
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between spectral type and luminosity class over this specif- 
ically chosen and narrow spectral range. There must, there- 
fore, be independent luminosity information present in this 
50-component reconstruction of the stellar spectra. The class 
IV classifications are still poor. 

The failure on class IV spectra implies that, at the res- 
olution of these spectra, class IV stars are not spectroscop- 
ically distinct from either class III or class V. Certainly, vi- 
sual classifiers find it hard to distinguish class IVs from Ills 
and Vs around late G-type stars (N. Houk, private commu- 
nication, 1996). It cannot be due to the PCA compression 
as complete spectrum classification gives almost exactly the 
same results as shown in Figure ^. Problems with the data 
reduction, e.g. imperfectly registering the spectra in wave- 
length, could also contribute. An alternative explanation is 
as follows. Visual classifiers can focuse on certain lines in a 
spectrum and disregard all others. In principal, neural net- 
works can do this too by altering their weights. However, 
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Figure 17. Distribution of class probabilities assigned by the three outputs of a committee of ten 50:5:5:3 
networks trained on line+continuum spectra. Each plot is a histogram of the committee probabilities for 
a certain network class (column) and a certain catalogue class (row). For example, the bottom- left hand 
diagram shows the distribution of the committee probabilities from the class III output node for spectra 
which are catalogue class V, and shows that most catalogue class V objects have been assigned a low 
probability of being class III. Each plot in a given row includes the same spectra, and this number has been 
used to normalize the fractions for that row. The total numbers of Ills, IVs and Vs in the test data set are 
848 Ills, 185 IVs and 1364 Vs, with almost exactly the same numbers in the training set. Due to the nature 
of the sigmoid function in the neural network we can never achieve 100% confidence. 



when there is noise in the spectrum some inputs will show 
random correlations with the target outputs. Thus the net- 
work will make a small level of false inference about the rel- 
evance of certain inputs in determining the outputs. With 
class IV discrimination the relevance of the few truly im- 
portant features may have been washed out this false 'noise 
association'. A solution to this problem is to use prior knowl- 
edge of which lines are relevant and train the network only 
on those features. Another approach is automatic relevance 



determination (MacKay 1995), which is a Bayesian tech- 
nique for assessing the relevance of the inputs using the ev- 
idence in the data. 

We attempted to use networks with two continuous out- 
puts to tackle the spectral type and luminosity class prob- 
lems simulataneously. However, the results were inferior, 
with the best results being a@s = 1-53 SpT (a rms = 2.02 
SpT) for the spectral type and a^s = 0.15 (a rms = 0.4) lumi- 
nosity classes (Bailer- Jones 1996). Due to the spectral type- 
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luminosity class correlation in the data set, the network may 
be unable to adequately separate out luminosity effects from 
temperature ones. This is not helped by the weakness of the 
luminosity distiguishing features in this wavelength region. 
In order to tackle both problems simultaneously, we may 
need a more complex model, and such complexity may not 
be available with modest-sized networks. 



8 SUMMARY 

We have produced a system for the automated two- 
parameter classification of stellar spectra over a wide range 
of spectral types (B2-M7) based on a large (> 5000), ho- 
mogenous set of spectra. We have shown that we can achieve 
classification errors of a$s = 0.82 subtypes (cr rms = 1.09 sub- 
types) over this complete range of spectral subtypes. This 
result compares favourably with the intrinsic errors of o"68 = 
0.63 subtypes in our training data. Once a neural network 
has been trained, its classification results are completely re- 
producible. Moreover, the low values of their internal errors 
(< 0.4 spectral subtypes) demonstrate that networks can be 
re-trained to give sufficiently consistent classifications. 

We have achieved correct luminosity class classification 
for over 95% of dwarfs (class V) and giants (class III). Re- 
sults for luminosity class IV spectra were considerably worse. 
It is believed that the data themselves could be a limiting 
factor and methods for improving these results were dis- 
cussed. Despite the correlation in the data set between spec- 
tral type and luminosity class, it was demonstrated that the 
neural networks were using luminosity features to do dwarf- 
giant discrimination. 

Network with two hidden layers performed considerably 
better (« 0.2 subtypes) than ones with only one hidden 
layer. The best classification results were achieved by tack- 
ling the spectral type and luminosity class problems sepa- 
rately, using continuous and probabilistic networks respec- 
tively. 

We used Principal Components Analysis to compress 
the spectra by a factor of over 30 while retaining 96% of the 
variance in the data. It was shown that this compression 
predominantly removes noise. In addition the PCA prepro- 
cessing reduces the dimensionality of the data and can be 
used to filter out bogus spectral features or identify unusual 
spectra. However, PCA has the drawback that very weak or 
rare features will not be well-reconstructed. More complex 
non-linear preprocessing schemes could no doubt be devised, 
but the strength of PCA is its analytic simplicity and its ro- 
bustness. 

The automated classifiers presented in this paper have 
been used to produce classifications for several thousand 
stars which do not have classifications listed in the MHD 
catallogue. These will be presented in a future paper (Bailer- 
Jones 1998). 
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